258 ◾ Bioinformatics
the centroid sequence (C) and skew of its abundance (aM) with respect to the centroid
sequence abundance (aC), which is given as
M C
a
a
M
C
(
)=
skew
,
(7.3)
When a member unique sequence has both an enough small distance and an enough small
skew with respect to the centroid sequence, then it is likely that sequence is incorrect read
of the centroid sequence with d points errors. The maximum skew (β) allowed for a cluster
member with d differences from the centroid sequence is given by
d
d
β( )=
α +
1
2
1
(7.4)
where α is set to 2 by default.
We can notice that as the distance d between the member sequence and centroid
increases, the maximum skew β decreases exponentially.
The unique sequences with low abundance are removed by the UNOISE2 algorithm.
The final products of any of the clustering and denoising methods are feature table and the
list of representative sequences. The feature table provides the feature abundance or the
number of a times a feature has been observed in a sample. A feature is a unit of observa-
tion that can be an OTU or an ASV. The feature table is needed for the downstream analysis
such as taxonomy assignment, construction of phylogenetic tree, and diversity analysis.
7.2.3 Taxonomy Assignment
Given a set of representative sequences generated by the above-discussed clustering or
denoising methods, the taxonomy assignment step will attempt to assign taxa for each
sequence. There are several methods for assigning taxonomy but, in general, they can
be categorized into (i) alignment-based methods such as BLAST and VSEARCH and (ii)
machine learning methods such as Ribosomal Database Project (RDP) Classifier. The out-
put of the taxonomy assignment methods is mapping a representative sequence to taxa and
providing an assignment quality score.
7.2.3.1 Basic Local Alignment Search Tool
The Basic Local Alignment Search Tool or BLAST [11] is a widely used seed-based heuris-
tic sequence search tool whose algorithm is adopted from the Smith-Waterman algorithm
for local sequence alignment. Providing a representative sequence (generated from cluster-
ing or denoising) as a search query to BLAST, the search is conducted against a database of
sequences with known taxonomy. Rather than aligning to a single sequence, the taxonomy
assignment is based on the consensus of hits in the reference database that exceed the prede-
termined percent identity. If the blast hits agree on the same taxonomy, then the representa-
tive sequence will be given that taxonomy level with consensus greater than a threshold.